The association between dietary patterns derived by three statistical methods and type 2 diabetes risk: YaHS-TAMYZ and Shahedieh cohort studies

Findings were inconsistent regarding the superiority of using recently introduced hybrid methods to derive DPs compared to widely used statistical methods like principal component analysis (PCA) in assessing dietary patterns and their association with type 2 diabetes mellitus (T2DM). We aimed to investigate the association between DPs extracted using principal component analysis (PCA), partial least-squares (PLS), and reduced-rank regressions (RRR) in identifying DPs associated with T2DM risk. The study was conducted in the context of two cohort studies accomplished in central Iran. Dietary intake data were collected by food frequency questionnaires (FFQs). DPs were derived by using PCA, PLS, and RRR methods considering. The association between DPs with the risk of T2DM was assessed using log-binomial logistic regression test. A total of 8667 participants aged 20–70 years were included in this study. In the multivariate-adjusted models, RRR-DP3 characterized by high intake of fruits, tomatoes, vegetable oils, and refined grains and low intake of processed meats, organ meats, margarine, and hydrogenated fats was significantly associated with a reduced T2DM risk (Q5 vs Q1: RR 0.540, 95% CI 0.33–0.87, P-trend = 0.020). No significant highest-lowest or trend association was observed between DPs derived using PCA or PLS and T2DM. The findings indicate that RRR method was more promising in identifying DPs that are related to T2DM risk compared to PCA and PLS methods.


Results
A total of 8667 study participants (52.5% females) had complete data to be entered to the current analysis of which 245 patients were diagnosed with T2DM after 6 years of follow-up for YaHS-TAMYZ study and 4 years for Shahedieh study. The baseline characteristics of the study population are presented in Table 1. There were significant differences between participant with and without T2DM across age categories, educational status, smoking status (P = 0.003), and total energy intake. Participants with T2DM were in higher age categories than participant without T2DM (P < 0.001). Compared to cases with T2DM, participant without T2DM were more to have high school diploma and BSc or higher academic degree (P = 0.026). In addition, participants with T2DM were more likely to be current and former smokers compared to participants without T2DM (P = 0.003). While, participants without T2DM had higher energy intake than cases with T2DM (P = 0.009). No significant difference was observed between two groups of participants for sex, marital status, BMI categories, and physical activity (P > 0.05).
The 33 dietary food groups and factor loadings for each DPs derived by PCA, PLS, and RRR methods are shown in Table 2. The first DPs derived by PCA method (PCA-DP1) was characterized by high intake of processed meats, organ meats, fish, margarine, fruit juice, pizza, snacks, sweet dessert, and soft drinks and low intake of whole grains. The PCA-DP2 was associated with high intakes of dairy products, fruits, tomatoes, other vegetables, potatoes, refined grains, and vegetable oils.in addition, and PCA-DP3 was characterized by high intake of tea, mayonnaise, nuts, hydrogenated fats, sugars, and soft drinks.
The first DPs from RRR method (RRR-DP1) was rich in whole grains and low in processed meats, red meats, poultry, fish, margarine, fruit juice, pizza, snacks, sweet dessert, and soft drinks. The RRR-DP2 was characterized primarily by high intake of poultry, fruits, soft drinks, and yoghurt drink and low intake of potatoes, refined grains, and mayonnaise; and the third DPs (RRR-DP3) was defined by high intake of fruits, fruit juice, refined grains, and vegetable oils, but low intake of processed meats, organ meats, margarine, and hydrogenated fats.
The percentage of variation explained by food groups was higher in DPs derived by PCA method (23.142%) in comparison to 19.252% of PLS-derived DPs and 13.89% for RRR-derived DPs ( Table 3).
The three DPs of PCA explained 0.324% of the response variables variation. DPs from PLS method explained 0.831% of the total variation in six response variables and the RRR-derived DPs explained 0.993%. As expected, both RRR and PLS methods explained a greater amount of variation in the response variables (Table 3). Figure 1 represents the risk of developing T2DM for each quintile of the DPs scores compared to the lowest quintile. No association was observed between three DPs from PCA method and T2DM risk in crude and all adjusted models.
The crude model of second DP derived by PLS method was inversely associated with T2DM risk (PLS-DP2 Q3 vs Q1: risk ratio (RR) 0.609, 95% confidence interval (CI) 0.39-0.94, P-trend = 0.585). In the multivariateadjusted models, PLS-DP2 method was found to be inversely associated with T2DM risk in participants in the third quintile than in people in the first quintile (Model I Q3 vs Q1: RR = 0.609, 95% CI 0. 39

Discussion
The greater adherence to a diet characterized by high intake of fruits, tomatoes, vegetable oils, and refined grains and low intake of processed meats, organ meats, margarine, and hydrogenated fats derived by RRR method was significantly associated with reduced T2DM risk. The present study also showed that the RRR method can provide a better identifying DPs that are related to T2DM risk due to the considering intermediate factors related to diseases for generating DPs. Several studies have assessed the association between DPs derived only by RRR method and T2DM. Duan et al. 20 reported that blood lipids-related DPs using the RRR method, for both men and women were characterized by high consumption of sugary beverages, juice, and added sugar; and low consumption of cereals, fruits, vegetables, nuts or seeds, and tea were significantly linked with an increased risk of T2DM. Liese et al. 21 have used the RRR method on plasminogen activator inhibitor-1 (PAI-1) and fibrinogen biomarkers to derive DPs, and they identified a DP that was predictive of T2DM which was characterized by a high intake of red meat, fried potatoes, tomato vegetables, dried beans, low-fiber bread and cereal, eggs, cheese, and low intake of wine. A nested case-control study identified a RRR-derived DPs using inflammatory biomarkers that was characterized by a high intake of processed meats, soft drinks, sugar-sweetened drinks, and refined grains, but a low intake of cruciferous and yellow vegetables, wine, and coffee that was associated with an increased T2DM risk 22 . However, the differences in the results of our study and the aforementioned study could be influenced by the difference between responses variables.
In the current study, people in second and third quintile of adherence to RRR-DP2 had lower risk of T2DM; In addition, the inverse association between adherence to PLS-DP2 and T2DM risk was observed only in modest Table 1. Baseline characteristics of study participants. T2DM type 2 diabetes mellitus, BMI body mass index, MET metabolic equivalent. a Values are mean ± standard deviation (SD) or percentage (%). b P-value are resulted from independent t-test for quantitative variables and from chi-square for qualitative variables. Our findings revealed that although PCA explains the highest variation in food groups, none of the derived DPs by this method were significantly associated with T2DM risk. This supports the view that PCA generates the diet behavior-related patterns and PCA-derived DPs could not necessarily predict the risk of diseases. Processed meats 0.237 - Table 3. Explained variation in food groups and responses using PCA, PLS, and RRR. *Food items from the food frequency questionnaire were defined as 33 separate food groups (Table 4). **Selected response variables were waist circumference, fasting blood glucose, triglycerides, low-density lipoprotein-cholesterol, highdensity lipoprotein-cholesterol, and total serum cholesterol. www.nature.com/scientificreports/ Altogether, in this study, we found more T2DM-associated DPs by using the RRR method than both PLS and PCA. In line with our results, Hoffmann et al. 13 compared three methods PCA, RRR, and PLS in association with T2DM and found that the RRR method could extract significant risk factors for T2DM. It should be considered that RRR method focuses on explaining variation in the disease-related response variables, while PLS is a method that mathematically considers both food groups and responses. This fact may explain the significant associations between RRR-derived DPs and T2DM rather than PLS-derived DPs. Moreover, in accordance with our results, some investigations demonstrated that RRR derived DPs had stronger and more statistically significant link with other outcomes than those derived using PCA and PLS 13,16,23 .

Explained variation in food groups
In line with this association, numerous previous studies support the link between consumption of fruits and vegetables and a decreased risk of T2DM. A meta-analysis of prospective studies found that T2DM risk reduced by 10% with increasing intakes of fruits up to 200-300 g/day 24 . A study by Nguyen et al. 25 showed that greater intake of fruits and vegetables are related to a lower risk of T2DM. Furthermore, a review established that the intake of fruit juices can decrease the risk of chronic diseases including T2DM 26 ; whereas, two meta-analyses did not proposed an association between fruit juice intake and T2DM risk 27,28 . The favorable effects of fruits and vegetables in the prevention of T2DM could be because of their high content of fiber, vitamins, minerals, antioxidants, and phytochemicals 29,30 . In addition, antioxidant phytochemicals contribute to the reduction of oxidative stress and inflammation 30 . For instance, it is shown that blueberries reduce blood glucose 31,32 and C reactive protein 31 and improve the insulin sensitivity 33 . Blueberries, grapes, and apples are rich in anthocyanins and quercetin [34][35][36] . Animal studies have shown that anthocyanins with anti-diabetic effects via glucose transporter 4 regulation 37 . Quercetin also has a protective role in reducing oxidative stress and beta-cell damage 38 . Moreover, the magnesium content of fruits and vegetables could improve insulin signaling 39 . It has been also demonstrated that the consumption of fruits and vegetables may reduce T2DM risk by decreasing adipose tissue and weight gain over time 29 . It is shown that tomatoes are beneficial for diabetic conditions due to reducing oxidative stress, inflammation, and tissue damages 40 . Tomatoes contain a wide range of antioxidants like lycopene, vitamins, and minerals 41 ; as a dose-response association was observed between serum lycopene levels and T2DM 42,43 .
It has been consistently shown that processed meats increase the T2DM risk in prospective studies 44,45 . A meta-analysis by Tian et al. 46 also revealed that the intake of processed meats is a risk factors for T2DM. It is conceivable that the high content of nitrates or nitrites in processed meats may increase the risk of T2DM 47 . Nitrosamine compounds in processed meats are formed during manufacturing or via interactions between nitrates and amino acids in the body 48 . It has been demonstrated that Nitrosamines have a toxic effect on β cells and can raise the T2DM risk 49,50 . Additionally, advanced glycation end products from processed meats can induce inflammatory mediators related to T2DM 51 . A growing body of evidence showed that DPs containing www.nature.com/scientificreports/ hydrogenated fats were positively associated with T2DM risk 52,53 . Trans fats are associated with an increased risk of T2DM through increasing TG levels, postprandial insulin and glucose, and reducing glucose uptake in skeletal and cardiac muscles. There are several limitations in this study that should be considered. Although FFQs are widely used to measure usual dietary exposures and considered as a valid and reproducible nutrition science tool, they are prone to possible misreporting and misclassification of study participants which might lead to weak or null relationships. Moreover, short follow-up period and the limited number of incident T2DM cases were other limitations of our study. In addition, both YaHS-TAMYZ and Shahedieh cohort studies had less than 5-year follow-up, therefore, in the present study, the long-term effects of DPs on T2DM risk might not be revealed. In general, determining the most effective method for deriving dietary patterns related to a specific disease varies according to the study goals such as study population, selected response variables, and outcome of interest. Further studies are required to examine the generalizability of DPs derived by different methods in other populations using the similar response variables.
In conclusion, the higher adherence to a diet characterized by high intake of fruits, tomatoes, vegetable oils, and refined grains and low intake of processed meats, organ meats, margarine, and hydrogenated fats was significantly associated with reduced risk of T2DM. The findings indicate that RRR method was more promising in identifying DPs that are related to T2DM risk than PCA and PLS methods. Though, future investigations are required to approve the relative advantages of the RRR method in association with T2DM and other nutritionrelated diseases.

Study design and study population. The Yazd Health Study (YaHS) was established in September 2014
in Yazd greater area located in central Iran. In this study 9962 participants aged 20-70 years were entered in the enrollment phase. The dietary intake assessment of participants was separately collected in Taghzieh Mardom Yazd (TAMYZ) study using a validated semi-quantitative food frequency questionnaire (FFQ) 54 . The Shahedieh cohort study is a part of a large Persian multicentral study (Persian cohort) conducted on 180,000 participants in 18 various geographical areas of Iran 55 . The Shahedieh study was established in 2014 and 9977 adults aged 35 to 70 years entered to the study at baseline. Participants also filled a semi-quantitative food frequency questionnaire to report their dietary intake. Information on demographic characteristics, smoking status, physical activity, medical history was also collected in both studies. The study protocol for YaHS-TAMYZ 56 and Shahedieh cohort 55 are completely described elsewhere.
Flow chart of participant's selection from YaHS-TAMYZ and Shahedieh cohort studies is showed in Fig. 2. Participants who reported an implausible total energy intake or incomplete dietary intakes data (< 800 kcal/day  www.nature.com/scientificreports/ or > 6000 kcal/day, YaHS-TAMYZ study, n = 639, Shahedieh study, n = 1709), those had not provided data on response variables (YaHS-TAMYZ study, n = 6258, Shahedieh study, n = 356), those who had a previous diagnosis of type 1 diabetes or T2DM (YaHS-TAMYZ study, n = 601, Shahedieh study, n = 1685), those who had not provided data on national identifier code (YaHS-TAMYZ study, n = 34, Shahedieh study, n = 0), and people who died (YaHS-TAMYZ study, n = 23, Shahedieh study, n = 28) were excluded, which left 8667 participants (YaHS-TAMYZ: 2468, Shahedieh: 6199) for current analyses. All participants gave an informed consent before entering the studies. Both studies were approved by the research Council of Shahid Sadoughi University of Medical Sciences. The current study was also ethically approved by Shahid Sadoughi University's ethics committee (approval code: IR.SSU.SPH. REC.1399.197). All methods of the present study were carried out according to the relevant guidelines and regulations.
Dietary assessment and food groups. Dietary intakes in the YaHS-TAMYZ study were assessed by a 178-item validated, multiple-choice semi-quantitative FFQ 54 . For each food item, participants were asked by trained interviewers to report the frequency of food item intake during the past year by answering 10-multiplechoice frequency responses ranging from "never or less than once a month" to "10 or more times per day". In addition, FFQ had five choices for portion size for estimation of the amount of each consumed food item 57 . Dietary intake information was collected by a semi-quantitative open-ended FFQ based on 134-items in the Shahedieh study. Participants of the Shahedieh study were asked to report how often on average over the previous year they consumed a typical portion size of each food item with multiple possible responses on a "daily", "weekly", or "monthly" basis. The frequency and portion size reported for food items were converted to grams per day using household measures 58 . The United States Department of Agriculture food composition database was used to estimate daily intake of energy and nutrient for each participant 59 . Food items were merged into 33 food groups based on food items similarity in their nutrient profiles and are presented in Table 4.
Assessment of other covariates. The height and body weight of the study participants were measured in both YaHS-TAMYS and Shahedieh studies. In the Shahedieh study, body weight (kg) and height (cm) were measured using the National Institute of Health protocols by trained staffs. Body weight was measured while the participants were with minimum clothing and without shoes by using a digital scale (SECA, model 755, Germany). Height was measured by using a measure tape attached to a flat wall with the accuracy of 0.5 cm. In the YaHS-TAMYS study, body weight was measured by using an Omron BF511 portable digital scale (Omron Inc. Nagoya, Japan) with the accuracy of 0.1 kg, while standing on the middle of the scale, without assistance and with minimum clothing and height was measured in a standing position using a tape measure on a straight wall to the nearest centimeter. Body mass index (BMI) was calculated as weight (kg)/height squared (m 2 ). Waist circumference (WC) was recorded to the nearest 0.5 cm by using non-stretch tape placed midway between iliac crest and lowest rib while participants were in the standing position. In addition, hip circumference was measured over the largest part of the buttocks, with an accuracy of 0.5 cm.
Data on age, gender, physical activity, education level, smoking status, marital status, and the history of chronic diseases was collected through a similar questionnaire in both cohort studies.
In the Shahedieh cohort study, participants were asked about their usual physical activity levels in the last year and in case they had seasonal jobs 60 . In the YaHS-TAMYZ cohort study, the short version of the International Physical Activity Questionnaire (IPAQ) was used to measure physical activity level of participants 61 . Physical activity was expressed as metabolic equivalent hours per week (MET-h/week) for all participants.
Age was classified into five categories (20-30, 30-40, 40-50, 50-60, and ≥ 60 years). Educational level was categorized into four levels (Uneducated, Elementary or guidance school, High school diploma, BSc or higher academic degree). Smoker participants were defined as current smokers, former smokers, and never smokers. Marital status was categorized into three categories (Single, Married, and Widowed or divorced).

Laboratory measurements.
Fasting blood glucose (FBG) (mg/dl), triglycerides (TG), low-density lipoprotein-cholesterol (LDL-c), high-density lipoprotein-cholesterol (HDL-c), and total serum cholesterol were measured in the YaHS-TAMYZ cohort study according to the standard laboratory protocol using Pars Azmoon kits and calibrated Ciba Corning (Ciba Corp, Basle, Switzerland) auto-analyzers. In Shahedieh cohort study, blood samples (25 mL) were collected from the participants after an overnight fasting (8-12 h). The blood samples were aliquoted into serum, buffy coat, and whole blood samples. FBG, TG, LDL-c, HDL-c, and total serum cholesterol were determined from the serum samples by an auto-analyzer (Analyzer BT1500) using Pars Azmoon standard kits.

Statistical analysis. Dietary patterns analysis. Three complementary data reduction techniques, includ-
ing PCA, RRR, and PLS, were used to identify DPs out of 33 food groups. In PCA method, the DPs explain as much variation as possible of the food groups. RRR method identifies linear functions of predictors (food groups) that explain as much intermediate responses variation as possible with using a covariance matrix of predictors and responses in calculating the DPs scores. The PLS method combines PCA and RRR methods and calculates DPs scores considering both the predictor and response matrices; therefore, the explained variance of both food groups and intermediate responses is expected to be between the PCA and RRR methods.
The number of DPs initially produced by PCA is constrained by the number of food groups used 62 ; However, we retained just three DPs from PCA for subsequent analysis was according to the scree plot, an eigenvalue (> 1), and the interpretability of the principal DPs 63 . Varimax rotation was applied to achieve orthogonal DPs and increase the interpretability of principle DPs. Sample adequacy was checked by using the Kaisere Mayere Olkin (KMO) test. The SAS procedure PLS were used to conduct PLS and RRR analysis, respectively. The number of DPs derived by PLS and RRR is restricted by the number of intermediate response variables used; Therefore, six DPs were specified in each method. For both methods, we calculated the continuous DPs scores (the linear functions of food groups) in the subsequent analyses and interpretations. The first three DPs obtained by PLS and RRR was retained for further analyses because these DPs explained the largest amount of variation among the response variables.

Scientific Reports
Follow-up of study participants and case confirmation. In the YaHS-TAMYZ cohort study, information on death events and T2DM incidence was collected by using data from population-based registries and linked outcome information from the aggregated hospital information system (Samanah Electronici PArvandeh Salamat-SEPAS) which covers 100% of public hospitals and the majority of private hospitals in Yazd province. The data was obtained from SEPAS using the National Identifier number of each participant to link data. www.nature.com/scientificreports/ During follow-up time in Shahedieh cohort study, participants received annual phone calls and follow-up questionnaires were completed in terms of the occurrence of death or the incidence of T2DM diagnosis. In case a participant had expired or had been diagnosed with T2DM, investigators followed the phone call with a house or hospital visit to perform a more follow-up and to collect copies of pertinent medical documents for further evaluation and recording. If needed, medical/physical examinations were performed to formulate a T2DM diagnosis. In addition, a verbal autopsy form validated in the Iranian population was completed during the death events. Two trained internists assessed the medical documents to determine the final T2DM diagnosis or cause of death. In case of inconsistency, a third internist conducted a final assessment of the documents to reach a final decision. The same follow-up procedures were followed in the case of self-reported T2DM incidence or death.
Descriptive analyses and modelling. Quantitative and qualitative variables were compared between participants who were diagnosed with and without T2DM using independent sample t-test and chi-square tests, respectively. Binomial logistic regression was used to evaluate the association between DPs derived by PCA, PLS, and RRR analyses and risk of T2DM incidence. All analyses were done in crude and three multivariable-adjusted models. The first model was adjusted for age, sex, and energy intake (Model I); Model II was additionally adjusted for education, marital status, smoking status, and physical activity; and in Model III, BMI was additionally controlled.
Comparison of dietary pattern methods. In this study, PCA, PLS and RRR methods were compared according to the relative factor loading within each DPs and its association with risk of T2DM. Additionally, these methods were evaluated based on the magnitude of variation of each method which explained the food groups and response variables.
Ethics approval and consent to participate. YaHS-TAMYZ and Shahedieh cohort studies were approved by the research Council and the ethics committee of Shahid Sadoughi University of Medical Sciences. The present study was approved by the ethics committee of Shahid Sadoughi University of Medical Sciences (Approval Code: IR.SSU.SPH.REC.1399.197). All participants gave an informed consent before entering both studies.

Data availability
The data of the current study is available from the corresponding author on reasonable request.